55 research outputs found

    Reducing branch delay to zero in pipelined processors

    Get PDF
    A mechanism to reduce the cost of branches in pipelined processors is described and evaluated. It is based on the use of multiple prefetch, early computation of the target address, delayed branch, and parallel execution of branches. The implementation of this mechanism using a branch target instruction memory is described. An analytical model of the performance of this implementation makes it possible to measure the efficiency of the mechanism with a very low computational cost. The model is used to determine the size of cache lines that maximizes the processor performance, to compare the performance of the mechanism with that of other schemes, and to analyze the performance of the mechanism with two alternative cache organizations.Peer ReviewedPostprint (published version

    Evaluating A+B=K conditions in constant time

    Get PDF
    The authors consider a type of condition that can be evaluated without requiring a complete ALU (arithmetic logic unit) operation. The circuit that is presented detects the condition A+B=K (n-bit numbers) in constant time, avoiding the carry propagation delay. This circuit can be used to detect a wide spectrum of conditions in branch instructions. It can improve the processor performance by advancing the evaluation of conditions and eliminating the pipeline delays produced by these operations.Peer ReviewedPostprint (published version

    Computing size-independent matrix problems on systolic array processors

    Get PDF
    A methodology to transform dense to band matrices is presented in this paper. This transformation, is accomplished by triangular blocks partitioning, and allows the implementation of solutions to problems with any given size, by means of contraflow systolic arrays, originally proposed by H.T. Kung. Matrix-vector and matrix-matrix multiplications are the operations considered here.The proposed transformations allow the optimal utilization of processing elements (PEs) of the systolic array when dense matrix are operated. Every computation is made inside the array by using adequate feedback. The feedback delay time depends only on the systolic array size.Peer ReviewedPostprint (published version

    Vectorized register tiling

    Get PDF
    In the last years, there has been much effort in commercial compilers (icc, gcc) to exploit efficiently the SIMD capabilities and the memory hierarchy that the current processors offer. However, the small numbers of compilers that can automatically exploit these characteristics achieve in most cases unsatisfactory results. Therefore, the programmers often need to apply by hand the optimizations to the source code, write manually the code in assembly or use compiler built-in functions (such intrinsics) to achieve high performance. In this work, we present source-to-source transformations that help commercial compilers exploiting the memory hierarchy and generating efficient SIMD code. Results obtained on our experiments show that our solutions achieve as excellent performance as hand-optimized vendor-supplied numerical libraries (written in assembly).Peer ReviewedPreprin

    Conflict-free strides for vectors in matched memories

    Get PDF
    Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free access to one family of strides in vector processors with matched memories. The paper extends these schemes to achieve this conflict-free access for several families. The basic idea is to perform an out-of-order access to vectors of fixed length, equal to that of the vector registers of the processor. The hardware required is similar to that for the access in order.Peer ReviewedPostprint (author's final draft

    Filtering directory lookups in CMPs

    Get PDF
    Coherence protocols consume an important fraction of power to determine which coherence action should take place. In this paper we focus on CMPs with a shared cache and a directory-based coherence protocol implemented as a duplicate of local caches tags. We observe that a big fraction of directory lookups produce a miss since the block looked up is not cached in any local cache. We propose to add a filter before the directory lookup in order to reduce the number of lookups to this structure. The filter identifies whether the current block was last accessed as a data or as an instruction. With this information, looking up the whole directory can be avoided for most accesses. We evaluate the filter in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with a shared L2 cache.We show that a filter with a size of 3% of the tag array of the shared cache can avoid more than 70% of all comparisons performed by directory lookups with a performance loss of just 0.2% for SPLASH2 and 1.5% for Specweb2005. On average, the number of 15-bit comparisons avoided per cycle is 54 out of 77 for SPLASH2 and 29 out of 41 for Specweb2005. In both cases, the filter requires less than one read of 1 bit per cycle.Postprint (published version

    Analysis and simulation of multiplexed single-bus networks with and without buffering

    Get PDF
    Performance issues of a single-bus interconnection network for multiprocessor systems, operating in a multiplexed way, are presented in this paper. Several models are developed and used to allow system performance evaluation. Comparisons with equivalent crossbar systems are provided. It is shown how crossbar EBW values can be reached and exceeded when appropriate operation parameters are chosen in a multiplexed single-bus system. Another architectural feature is considered, concerning the utilization of buffers at the memory modules. With the buffering scheme, memory interference can be reduced so that the system performance is practically improved.Peer ReviewedPostprint (published version

    Systematic design of two level pipelined systolic arrays with data contraflow

    Get PDF
    Many systolic algorithms and related design methodologies have been recently proposed. Frecuently, in these systolic algorithms practical considerations are not taken into account. Equitatively distributed load between processing elements, pipelined functional units etc, are desirable features when implementing systolic algorithms.In this paper we present a design methodology in which these features are considered. As an example, the methodology is applied to obtain a problem-size-independent, two-level pipelined 1D systolic algorithm with data contraflow to efficiently solve triangular systems of equations.Peer ReviewedPostprint (published version

    El optimizador de bucles del compilador Open64/ORC (parte 2)

    Get PDF
    Open64 y ORC (Open Research Compiler) son dos iniciativas de código abierto basadas en el compilador SGI Pro64. Open64 está gestionada por miembros de la Universidad de Delaware, y ORC es una extensión del compilador desarrollada por Intel y la Chinese Academy of Science. Para más información consultar las respectivas páginas web [2] y [1]. SGI Pro64 es un conjunto de compiladores optimizadores desarrollados por SGI. Incluye compiladores de C, C++ y Fortran90/95 que siguen los estándares ABI y API de Linux IA-64. Los archivos fuente son de dominio público y se distribuyen bajo los términos de la GNU General Public License. El conjunto de compiladores está disponible para correr sobre plataformas Linux IA-32 e IA-64. Este documento continúa el trabajo iniciado en los technical reports “Introducción al compilador Open64/ORC” [10] y “El optimizador de bucles del compilador Open64/ORC (parte 1)” [11]. El primero describe los componentes del compilador y la representación intermedia que se utiliza como interficie común entre ellos. El segundo documento se centra específicamente en uno de los componentes del compilador: el optimizador de bucles.Postprint (published version

    Source-to-Source transformations for efficient SIMD code generation

    Get PDF
    In the last years, there has been much effort in commercial compilers to generate efficient SIMD instructions-based code sequences from conventional sequential programs. However, the small numbers of compilers that can automatically use these instructions achieve in most cases unsatisfactory results. Therefore, the code often has to be written manually in assembly language or using compiler built-in functions to achieve high performance. In this work, we present source-to-source transformations that help commercial vectorizing compilers to generate efficient SIMD code. Experimental results show that excellent performance can be achieved. In particular, for the problem of matrix product (SGEMM) we almost achieve as high performance as hand-optimized numerical libraries. Our source-tosource transformations are based on the scalar replacement and unroll and jam transformations presented by Callahan et all. In particular, we extend the use of scalar replacement to vectorial replacement and combine this transformation with unroll and jam and outer loop vectorization to fully exploit the vector register level and thus to help the compiler to generate efficient SIMD code. We will show experimentally the effectiveness of our proposal.Peer ReviewedPostprint (published version
    corecore